Add a random seed specific to datagen cases #9441

abellina · 2023-10-13T15:00:19Z

Please note that there are a number of overrides in the integration tests, and I wasn't planning on changing these, but I can I need to look into why the are seeding from 0 each time.

src/main/python/collection_ops_test.py: step_gen.start(random.Random(0))
src/main/python/collection_ops_test.py: gen.start(random.Random(0))
src/main/python/json_fuzz_test.py:_name_gen.start(random.Random(0))

src/main/python/cache_test.py: def op_df(spark, length=2048, seed=0):
src/main/python/generate_expr_test.py:def four_op_df(spark, gen, length=2048, seed=0):
src/main/python/expand_exec_test.py: def op_df(spark, length=2048, seed=0):

I did not add the random seed to the test name, but I easily can. The reason I was leaning against this is that the user can easily set the seed, so the seed in the name of the test wouldn't match if they have changed something internally. Every test would use the same seed, and it's overwrite-able via environment variable or via the seed parameter in datagen.

The variable we can use to override this in manual runs is: SPARK_RAPIDS_TEST_DATAGEN_SEED

Signed-off-by: Alessandro Bellina <[email protected]>

abellina · 2023-10-13T16:03:55Z

build

abellina · 2023-10-13T17:56:23Z

I have updated the test names to have the seed in the test name, here's an example:

../../src/main/python/dpp_test.py::test_dpp_via_aggregate_subquery[false-0-parquet][DATAGEN_SEED=1697219724, INJECT_OOM, IGNORE_ORDER]

abellina · 2023-10-13T18:00:36Z

build

revans2 · 2023-10-13T18:12:01Z

If I could have everything I wanted I would change the datagen so the seed is not passed into it at all. Then if someone wants to override the seed they use an annotation to force it to a specific seed.

revans2

Looks good generally.

revans2 · 2023-10-13T18:12:46Z

integration_tests/src/main/python/conftest.py

@@ -113,7 +113,6 @@ def is_parquet_testing_tests_forced():
 _inject_oom = None

 def should_inject_oom():
-    global _inject_oom


Is the global not needed any more?

This is something I owed @gerashegalov from a long time ago #7925 (comment).

Since we don't need to change the variable, it's not needed.

abellina · 2023-10-13T18:22:29Z

If I could have everything I wanted I would change the datagen so the seed is not passed into it at all. Then if someone wants to override the seed they use an annotation to force it to a specific seed.

If you had multiple datagens with the argument approach one can set a seed for each datagen invocation. With an annotation, I'd have to check if a function invocation can be annotated, though at that point I am not sure there's much point to that (though I could be wrong)

abellina · 2023-10-13T18:32:28Z

I am seeing a failure in CI:

FAILED ../../src/main/python/arithmetic_ops_test.py::test_decimal_bround[Float][DATAGEN_SEED=1697220957, INJECT_OOM, INCOMPAT, APPROXIMATE_FLOAT]

So I'll take a look at reproing this locally.

abellina · 2023-10-16T13:48:07Z

This is another manifestation of #9350. That said, that's the only test that failed in this CI run. If we did have that parameter annotation like @revans2 suggested, we could at least print in the test that we meant to override the seed... I'll check on that. But worse case I may skip this test since we have a ticket for it.

abellina · 2023-10-16T14:51:31Z

BTW, the test failure is easy to repro via:

SPARK_RAPIDS_TEST_DATAGEN_SEED=1697220957 ./run_pyspark_from_build.sh -k test_decimal_bround

… @datagen_overrides(seed=12356)

abellina · 2023-10-16T15:04:07Z

build

abellina · 2023-10-16T15:43:08Z

@firestarman mind taking a look at this diff 7fc82d6, I am not sure if the random with seed=0 was set on purpose.

abellina · 2023-10-16T15:43:14Z

build

firestarman · 2023-10-26T01:36:41Z

@firestarman mind taking a look at this diff 7fc82d6, I am not sure if the random with seed=0 was set on purpose.

I probably just copied it from the old code and it is ok to remove it if no failures are found.

… missing

abellina · 2023-10-27T15:43:14Z

build

abellina · 2023-11-02T15:43:34Z

build

revans2 · 2023-11-03T21:41:27Z

Looks like there are still some more tests that need to be fixed/xfailed + issues filed before we can check this in.

abellina · 2023-11-03T21:43:20Z

Looks like there are still some more tests that need to be fixed/xfailed + issues filed before we can check this in.

Yes, there's a list of issues that are failing in CI, some I can repro locally, some I can't. I should be able to get back to this next week.

…into add_test_seed_for_datagen

abellina · 2023-11-13T22:20:12Z

Follow on issue: #9703

abellina · 2023-11-13T22:24:21Z

build

abellina · 2023-11-14T14:39:24Z

build

abellina · 2023-11-14T18:22:29Z

build

abellina · 2023-11-14T18:35:36Z

build

revans2

Generally looks good. I just want to be sure that there is a follow on issue for every test that has a hard coded seed, or a good explanation as to why it is hard coded.

revans2 · 2023-11-14T18:47:11Z

integration_tests/src/main/python/cache_test.py

@@ -91,11 +91,12 @@ def do_join(spark):
 @pytest.mark.parametrize('data_gen', all_gen, ids=idfn)
 @pytest.mark.parametrize('enable_vectorized_conf', enable_vectorized_confs, ids=idfn)
 @ignore_order
+@datagen_overrides(seed=0)


Why no reason here?

oh whoops, let me add that.. not intended

I am re-running CI with these removed. I can't repro these locally.

revans2 · 2023-11-14T18:55:09Z

integration_tests/src/main/python/generate_expr_test.py


 #sort locally because of https://github.com/NVIDIA/spark-rapids/issues/84
 # After 3.1.0 is the min spark version we can drop this
 @ignore_order(local=True)
+@datagen_overrides(seed=0)


Where is the reason for these?

revans2 · 2023-11-14T18:57:01Z

integration_tests/src/main/python/data_gen.py

    data_gen.start(rand)
    data = [data_gen.gen() for index in range(0, length)]
    return data

-def gen_df(spark, data_gen, length=2048, seed=0, num_slices=None):
+def gen_df(spark, data_gen, length=2048, seed=None, num_slices=None):


Do we want a follow on issue to remove seed for gen_df and force us to go through the annotation?

abellina · 2023-11-14T20:37:25Z

build

abellina · 2023-11-14T20:38:33Z

build

abellina · 2023-11-14T23:10:10Z

build

abellina · 2023-11-15T14:16:18Z

build

Add a random seed specific to datagen cases

50373a9

Signed-off-by: Alessandro Bellina <[email protected]>

abellina requested a review from revans2 October 13, 2023 15:00

abellina added 2 commits October 13, 2023 12:56

Add seed to the test name

cd33b16

Fix extra global that was pending from a prior pr

f253585

abellina added the test Only impacts tests label Oct 13, 2023

revans2 previously approved these changes Oct 13, 2023

View reviewed changes

Add test marker datagen_overrides, with seed as a supported argument:…

550a90f

… @datagen_overrides(seed=12356)

abellina dismissed revans2’s stale review via 550a90f October 16, 2023 15:02

Fix typo

282af71

Remove hard coding seed in collection_ops_test

7fc82d6

abellina added 2 commits October 27, 2023 09:02

Upmerge 23.12

6b98079

Pass seed to step_gen start otherwise we get errors that _gen_func is…

8a684ce

… missing

abellina added 2 commits November 13, 2023 08:01

Merge branch 'branch-23.12' of https://github.com/NVIDIA/spark-rapids …

178c212

…into add_test_seed_for_datagen

Add datagen_overrides for tests that failed

0a7f4bb

abellina added 3 commits November 14, 2023 08:29

add more overrides

5428c8e

Upmerge

a8c3723

add override for test_datetime_roundtrip_with_legacy_rebase

6afb4bf

abellina mentioned this pull request Nov 14, 2023

Follow up from random datagen seed PR #9703

Closed

Add override for test_cast_string_ts_valid_format

5ef7bb5

revans2 reviewed Nov 14, 2023

View reviewed changes

Remove overrides without reason to try and repro in CI

e3354e4

Add another override, this time for ast_test

fdf4915

abellina mentioned this pull request Nov 14, 2023

Remove seed for gen_df and use @datagen_overrides #9712

Open

Add override for test_floor_scale_zero

c7e6322

revans2 approved these changes Nov 15, 2023

View reviewed changes

abellina merged commit e4fdd84 into NVIDIA:branch-23.12 Nov 15, 2023
36 checks passed

abellina deleted the add_test_seed_for_datagen branch November 15, 2023 17:50

jlowe mentioned this pull request Nov 16, 2023

Avoid generating null filter values in test_delta_dfp_reuse_broadcast_exchange [databricks] #9745

Merged

jlowe mentioned this pull request Dec 11, 2023

Add documentation for how to run tests with a fixed datagen seed [skip ci] #10014

Merged

Add a random seed specific to datagen cases #9441

Add a random seed specific to datagen cases #9441

Conversation

abellina commented Oct 13, 2023 • edited Loading

abellina commented Oct 13, 2023

abellina commented Oct 13, 2023

abellina commented Oct 13, 2023

revans2 commented Oct 13, 2023

revans2 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abellina commented Oct 13, 2023 • edited Loading

abellina commented Oct 13, 2023

abellina commented Oct 16, 2023

abellina commented Oct 16, 2023

abellina commented Oct 16, 2023

abellina commented Oct 16, 2023

abellina commented Oct 16, 2023

firestarman commented Oct 26, 2023

abellina commented Oct 27, 2023

abellina commented Nov 2, 2023

revans2 commented Nov 3, 2023

abellina commented Nov 3, 2023

abellina commented Nov 13, 2023 • edited Loading

abellina commented Nov 13, 2023

abellina commented Nov 14, 2023

abellina commented Nov 14, 2023

abellina commented Nov 14, 2023

revans2 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abellina commented Nov 14, 2023

abellina commented Nov 14, 2023

abellina commented Nov 14, 2023

abellina commented Nov 15, 2023

abellina commented Oct 13, 2023 •

edited

Loading

abellina commented Oct 13, 2023 •

edited

Loading

abellina commented Nov 13, 2023 •

edited

Loading